Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement refscan-based real-time referential integrity checking on /metadata/json:validate #835

Conversation

eecavanna
Copy link
Collaborator

@eecavanna eecavanna commented Dec 12, 2024

In this branch, I implemented real-time referential integrity checking on the /metadata/json:validate API endpoint. I also added automated tests that target the validate_json function.

image

Details

Real-time referential integrity checking

Previously, the validate_json function — the function that underlies the /metadata/json:validate API endpoint — did not validate inter-document references.

On this branch, I updated that function so that it does validate inter-document references... if the caller "opts into" that by setting the newly-introduced check_inter_document_references: bool = False parameter (to the validate_json function) to True.

Based upon recent conversations with stakeholders, I "opted in" only the /metadata/json:validate API endpoint to this new validation. I did not "opt in" the /metadata/json:submit API endpoint. That is so people can continue to run JSON-submitting code (i.e. API clients) that create referrer documents before creating the referree documents, a sequence of events that results in the database not having full referential integrity for a period of time.

The core referential integrity checking is done by a function import-ed from the refscan PyPI package. On this branch, I have added refscan to requirements/main.in (followed by running $ make update-deps).

Tests targeting the validate_json function

Previously, there were no unit tests targeting the validate_json function. I added some on this branch.

In addition to targeting the original behavior of the function, which did not check inter-document references; the tests also target the new behavior of the function, where it checks inter-document references.

Related issue(s)

#831

Related subsystem(s)

  • Runtime API (except the Minter)
  • Minter
  • Dagster
  • Project documentation (in the docs directory)
  • Translators (metadata ingest pipelines)
  • MongoDB migrations
  • Other

Testing

  • I tested these changes (explain below)
  • I did not test these changes

I tested these changes by implementing and running unit tests that target them.

Documentation

  • I have not checked for relevant documentation yet (e.g. in the docs directory)
  • I have updated all relevant documentation so it will remain accurate
  • Other (explain below)

Maintainability

  • Every Python function I defined includes a docstring (test functions are exempt from this)
  • Every Python function parameter I introduced includes a type hint (e.g. study_id: str)
  • All "to do" or "fix me" Python comments I added begin with either # TODO or # FIXME
  • I used black to format all the Python files I created/modified
  • The PR title is in the imperative mood (e.g. "Do X") and not the declarative mood (e.g. "Does X" or "Did X")

@eecavanna eecavanna self-assigned this Dec 12, 2024
eecavanna and others added 3 commits December 17, 2024 20:20
Previously, we were checking it after inserting documents into _each_
specified collection, which made it so we would not know whether a
referenced document would have been inserted into a later collection.
nmdc_runtime/api/main.py Outdated Show resolved Hide resolved
@eecavanna
Copy link
Collaborator Author

eecavanna commented Jan 14, 2025

The failing test involves the use of the string "@type" as a collection name and a raw string (not a list of dictionaries) as a collection value. The validate_json function seems to have been designed to accept that. I don't know why / I don't know the background here. 🤷 I have started a discussion about it here: #858

For now, I'll update the referential integrity checking code to ignore incoming collections that are not lists.

eecavanna and others added 5 commits January 13, 2025 18:52
I don't know the use case here. It is a use case that the preexisting
validation stages allowed for. I do not see any documentation about it.
@eecavanna eecavanna marked this pull request as ready for review January 14, 2025 03:26
@eecavanna eecavanna requested a review from aclum January 14, 2025 03:26
@eecavanna eecavanna requested review from dwinston and shreddd January 14, 2025 03:26
@eecavanna eecavanna merged commit d3e146b into main Jan 14, 2025
2 checks passed
@eecavanna eecavanna deleted the 831-implement-refscan-based-real-time-referential-integrity-validation-in-runtime branch January 14, 2025 20:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement refscan-based real-time referential integrity validation in Runtime
3 participants